-
Notifications
You must be signed in to change notification settings - Fork 51
feat: support embedding and image generation (DRAFT) #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support embedding and image generation (DRAFT) #85
Conversation
…eInferenceParameters, EmbeddingInferenceParameters
…th BaseInferenceParameters
| if response.data and len(response.data) == len(input_texts): | ||
| return [data["embedding"] for data in response.data] | ||
| else: | ||
| raise ValueError(f"Expected {len(input_texts)} embeddings, but received {len(response.data)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be an issue if response.data = None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the documentation, upon calling .embedding(...) we're either going to get an EmbeddingResponse object or some exception will be raised. And EmbeddingResponse.data is list... the later check should kick in if that list is empty right?
src/data_designer/engine/dataset_builders/column_wise_builder.py
Outdated
Show resolved
Hide resolved
| input_chunks = [chunk.strip() for chunk in input_chunks if chunk.strip()] | ||
| embeddings = self.model.generate_text_embeddings(input_texts=input_chunks) | ||
| data[self.config.name] = { | ||
| "embeddings": embeddings, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that these 3 fields are added as a JSON in a single column, correct? Considering that embeddings is list[list[float]] (?), is that an issue? Could it be the that JSON is added as a string and embeddings is encoded sub-optimally, truncated etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that these 3 fields are added as a JSON in a single column, correct?
Yes, that's correct. I'll double check what happens when we serialize these as partial results and report back. I think they were serialized correctly without truncation when I ran some tests.
Specifying Model TypeIn the current example, the nature of the model must be inferred by the provided inference parameter type. E.g. the only way to know that the following model is, in fact, an Embedding model is to see that the user provided ModelConfig(
alias="nvidia-embedder",
model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
provider="nvidia",
inference_parameters=EmbeddingInferenceParameters(
extra_body={"input_type": "query"},
),
)Another natural thing to do would be to create inheritors of the EmbeddingModelConfig(
alias="nvidia-embedder",
model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
provider="nvidia",
inference_parameters=EmbeddingInferenceParameters(
extra_body={"input_type": "query"},
),
)This would also allow the flexibility later to be able to specify Chunking As A Separate ActionPresently we have to specify some patterns for specifying chunking, but this seems a bit cumbersome to join together the chunking + embedding into a single step. config_builder.add_column(
EmbeddingColumnConfig(
name="embedding_nvidia",
model_alias="nvidia-embedder",
target_column="questions",
chunk_pattern=f"\n+"
)
)Another option might be to create a separate config_builder.add_column(
ChunkTextColumnConfig(
name="questions_chunked",
target_column="question",
chunk_style="newlines"
)
)
config_builder.add_column(
EmbeddingColumnConfig(
name="embedding_A",
model_alias="model_A",
target_column="questions_chunked",
)
)
config_builder.add_column(
EmbeddingColumnConfig(
name="embedding_B",
model_alias="model_B",
target_column="questions_chunked",
)
) |
|
Another advantage of the separate chunking pattern is that we can use it in other contexts, e.g. if we want to extract a random chunk from a document. |
|
@eric-tramel thanks for the feedback!
IMO, I initially added
This is a good call out... something I was playing around with. The main goal of this generator is to allow generation of multiple embeddings per cell. Allowing a way to split the content of the target column via regex pattern within embedding generation was something I thought was generic, but I totally see the simplicity of decoupling these two. In the example you provided, after we use a different generator to do chunking, we still need a way to tell the embedding generator how to discover these different chunks.... unless we explicitly say that |
Gotcha -- but then how does is this going to be handled at the CLI when inputting model config parameters? Will the user need to specify the kind of model at that point, too?
Yep, in this case an It would be interesting in the future to consider the option of some kind of explode feature, like in pandas/polars. This would be entirely optional, but can allow a user to flatten the nesting of their dataset if that's not what they want. For instance, the below pattern would allow one to create a dataset of "all embedding chunks" from a set of source documents without needing to unpack or fiddle with nesting structures themselves. ## Start with N documents in the dataset, now chunk.
config_builder.add_column(
ChunkTextColumnConfig(
name="document_chunked",
target_column="document",
chunk_params={"style": "contiguous", "max_chars": 4096},
)
)
"""
Now, assume M chunks / document. We now have N rows where each row contains
document_chunked_row_0 = [chunk_0_0, ..., chunk_0_M]
...
document_chunked_row_N = [chunk_N_0, ..., chunk_N_M]
Next, let's say we want to operate on each chunk independent for the rest of the workflow.
"""
config_builder.explode(name="document_chunked")
"""
Now, we have N*M rows in our dataset generated after document_chunked created.
document_chunked_row_0 = chunk_0_0
document_chunked_row_1 = chunk_0_1
...
document_chunked_row_M = chunk_0_M
...
document_chunked_row_NM = chunk_N_M
And perhaps we want to do our embedding now
"""
config_builder.add_column(
EmbeddingColumnConfig(
name="embedding_A",
model_alias="model_A",
target_column="questions_chunked",
)
) |
Right, something like that. I hadn't actually thought about the cli, so thanks for raising! It might make sense to hang Updated in c6c29d4 |
The need to support generation of multiple embeddings per row in a single generator is exactly because within the NDDL workflow we don't yet have a way to explode and reduce rows. Until that becomes a reality, keeping the embedding generator simpler (to operate on strings or list of strings) without worrying about chunking should suffice and is generic enough. We can add chunking support in a different PR. Let me update this PR and incorporate your suggestions! I removed the chunking param/logic in 06a724b. The embedding generator now expects column it targets to have a string or a stringified json list of strings. |
…lved based on the type of InferenceParameters
|
Closing this PR in favor of #106 |
Opening a draft PR for initial feedback. I currently have the modality extended to support embedding generation, but we can follow the same pattern to support image generation.
Major change is the need to break out
InferenceParametersinto generation type specific ones. Changes include renaming existingInferenceParameters -> CompletionInferenceParameterswith backwards compatibility + deprecation warning.I'm working on expanding to image generation in the same PR and will mark this as ready for review when that's done.
Here's an example of what the workflow looks like for embeddings